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(54) Method and system for aligning natural and synthetic video to speech synthesis 



(57) According to MPEG-4's TTS arch itecture, facial 
animation can be driven by two streams simultaneously 
- text, and Facial Animation Parameters. In this archi- 
tecture, text input is sent to a Text-To-Speech converter 
at a decoder that drives the mouth shapes of the face. 
Facial Animation Parameters are sent from an encoder 
to the face over the communication channel. The 
present invention includes codes (known as book- 
marks) in the text string transmitted to the Text-to- 
Speech converter, which bookmarks are placed be- 
tween words as well as inside them. According to the 
present invention, the bookmarks carry an encoder time 



stamp. Due to the nature of text-to-speech conversion, 
the encoder time stamp does not relate to real-world 
: time, and should be interpreted as a counter. In addition, 
the Facial Animation Parameter stream carries the 
same encoder time stamp found in the bookmark of the 
text. The system of the present invention reads the 
bookmark and provides the encoder time stamp as well 
as a real-time time stamp to the facial animation system. 
Finally, the facial animation system associates the cor- 
rect facial animation parameter with the real-time time 
stamp using the encoder time stamp of the bookmark 
as a reference. 



m 

CO 



rra 
O 



FIG. 2 



CM 

< 

CM 
CM 
CO 

CD 
G> 
00 

o 

Q_ 
LU 



' MPEG4 
HYBRID 
TTS 



cooto 
speech; 



TTS-FA 
INTERFACE 



PHONEUE 
TAP 
| CONVERTER 



VIDEO 
DATA 



AP/ 
UP 



10 

_L_ 

AUDIO 
DECODER 



SPEECH, 





UP-SHAPt 


— 6 


ANALYZER 



VISUAL 
DC CODER 

— <; — 



SPEAKERS/ 
HEAOPHOMES, 



COMPOSITOR 



user mn 



DISPLAY 
TCRMHAL 



Printed by Jouve, 75001 PARIS (FR) 



E3NSDOCID: <EP 0896322A2_I_> 



1 



BP 0 3SS 322 A2 



2 



Description 

BACKGROUND OF THE INVENTION " 

[0001] The present invention relates generally* to 
methods and systems for coding of images, and more 
particularly to a method and system tor coding -images 
of facial animation. - 
[0002] Accordingto MPEG-4's TTS architecture, fa- 
cial animation can be drivenby two streams* simultane- 
ously - text, and Facial Animation Parameters (FAPs). 
In this architecture; text input is : sent ; tb a Text^To- 
Speech (TTS) converter at a decoder that drives the 
mouth shapes of the face. FAPs are sent from an en- 
coder to the face over the communication channel. Cur- 
rently, the Verification Model (VM) assumes that syn- 
chronization between the input side and the FAP input 
stream is obtained by 'means of timing injected at the 
transmitter side. However, the transmitter does not 
know the timing of the decoder TTS. Hence, the encoder 
cannot specify the alignment between synthesized 
words and the facial animation. Furthermore, timing var- 
ies between different TTS systems. Thus, there current- 
ly is no method of aligning facial mimics (e.g., smiles, 
and expressions) with speech. 

[0003] The present invention is therefore directed to 
the problem of developing a system and method for cod- 
ing images for facial animation that enables alignment 
of facial mimics with speech generated at the decoder 

SUMMARY OF THE INVENTION 

[0004] The present invention solves this problem by 
including codes (known as bookmarks) in the text string 
transmitted to the Text-tb-Speech (TTS) converter, 
which bookmarks can be placed between words as well 
as inside them. According to the present invention, the 
bookmarks carry an encoder time stamp (ETS). Due to 
the nature of text-to-speech conversion, the encoder 
time stamp does not relate to real-world time, and 
should be interpreted as a counter. In addition, accord- 
ing to the present invention, the Facial Animation Pa- 
rameter (FAP) stream carries the same encoder time 
stamp found in the bookmark of the text. The system of 
the present invention reads the bookmark and provides 
the encoder time stamp as well as a real-time time 
stamp (RTS) derived from the timing of its TTS converter 
to the facial animation system. Finally, the facial anima- 
tion system associates the correct facial animation pa- 
rameter with the real -time time stamp using the encoder 
time stamp of the bookmark as a reference. In order to 
prevent conflicts between the encoder time stamps and 
the real-time time stamps, the encoder time stamps 
have to be chosen such that a wide range of decoders 
can operate. 

[0005] Therefore, in accordance with the present in- 
vention, a method for encoding a facial animation includ- 
ing at least one facial mimic and speech in the form of 



a text stream, 'comprises the steps of assigning a pre- 
determined code to the at least one facial mimic, and 
placing the predetermined code within the text stream, 
wherein said code indicates a presence of a particular 
: 5 facia! mimic. The predetermined code is a unique es- 
cape sequence that does not interfere with the normal 
operation of a text-to^speech synthesizer. 
[0005] One possible embodiment of this method uses 
the predetermined code as a pointer to a stream of lacial 

10 mimics thereby indicating a synchronization relationship 
between the text stream and the facial mimic stream. 
[0007] One possible implementation p( the predeter- 
mined code is an escape sequence, followed by a plu- 
rality of bits, which define one of a set of facial mimics. 

15 In this case, the predetermined code can be placed in 
between words in the text stream, or in between letters 
in the text stream. 

* [0008] Another method according to the present in- 
vention for encoding a facial animation includes the 

20 steps of creating a text stream, creating a facial mimic 
stream, inserting a plurality of pointers in the text stream 
pointing to a corresponding plurality of facial mimics in 
the facial mimic stream, wherein said plurality of point- 
ers establish a synchronization relationship with said 

2S text and said facial mimics. 

[0OG9] " According to the present invention, a method 
for decoding a facial animation including speech and at 
least one facial mimic includes the steps of monitoring 
a text stream for a set of predetermined codes corre- 

30 sponding to a set of facial mimics, and sending a signal 
to a visual decoder to start a particular facial mimic upon 
detecting the presence of one of the set of predeter- 
mined codes. 

[0010] According to the present invention, an appara- 

35 {us for decoding an encoded animation includes a de- 
multiplexer receiving the encoded animation, outputting 
a text stream and a facial animation parameter stream, 
wherein said text stream includes a plurality of codes 
indicating a synchronization relationship with a plurality 

40 of mimics in the facial animation parameter stream and 
the text in the text stream, a text to speech converter 
coupled to the demultiplexer, converting the text stream 
to speech, outputting a plurality of phonemes, and a plu- 
rality of real-time time stamps and the plurality of codes 

4S in a one-to-one correspondence, whereby the plurality 
of real-time time stamps and the plurality of codes indi- 
cate a synchronization relationship between ihe plurality 
of mimics and the plurality of phonemes, and a phoneme 
to video converter being coupled to the text to speech 

50 converter, synchronizing a plurality of facial mimics with 
the plurality of phonemes based on the plurality of real- 
time time stamps and the plurality of codes. 
[0011] In the above apparatus, it is particularly advan- 
tageous if the phoneme to video converter includes a 

55 facial animator creating a wireframe image based on the 
synchronized plurality of phonemes and the plurality of 
facial mimics, and a visual decoder being coupled to the 
demultiplexer and the facial animator, and rendering the 
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video image based on the wireframevtmage.-* 

BRIEF DESCRIPTION OF THE DRAWINGS r 

[0012] FIG V depicts^the environment in- which, the 
present invention will be applied. -. : ;v ■ .-. 
[001 3] FIG 2 : depicts the architecture of ancMPEGr4 
= decoder'using text-to-speech conversion. - \ a -~uo; 

DETAI liED 'DESCRIPTION - ^ .l^-v- c : 

[0014] - According ro : the present invention, ^t&rsyn- 
chronization of the decoder system can be achieved. by 
using local synchronization by means of event buffers 
at the input of FA/AP/MP and the audio decoder- Alter- 
natively, a global synchronization control can be imple- 
mented, v - : , 
[0015] A maximum drift ot8G msec between tbe.^n- 
coder time stamp (ETS) in the text and,the;ET3 in: the 
Facial Animation Parameter (FAP) stream is tolerable. 
[0016] One embodiment for the syntax of the book- 
marks when placed in- the text stream consists of. an es- 
cape signal followed by the bookmark content, e.g., \|M 
{bookmark content}. The bookmark content carrier a 
16-bit integer time stamp ETS- and additionakinforma- 
tion. The same ETS Is- added 5 to the correspondmg;FAP 
stream to enable synchronization; The class of Facial 
Animation Parameters is extended to carry the optional 
ETS. • ' ' 

[0017] If an absolute clock reference (PGR). is provid- 
ed, a drift compensation scheme can be implemented. 
Please note, there is no master slave notions between 
the FAP stream and the text. This is because the. decod- 
er might decide to vary the speed of the text as; weM; as 
a variation of facial animation might become necessary, 
if an avatar reacts to visual events happening in its en- 
vironment. 

[0018] For example, if Avatar 1 is talking to the user. 
A new Avatar enters the room. A natural reaction- of av- 
atar 1 is to took at avatar 2,; smile and ; while doing so, 
slowing down the speed of the spoken texb - 

Autonomous Animation Driven Mostly by Text ; ; 

[0019] In the case .of facial -animation driven by text, 
the additional animation of the face is mostly restricted 
to events that do not have to be animated; at a.rate of 
30 frames per second. Especially high-level action units 
like smile should be defined at a much lower rate. Fur- 
thermore, the decoder can do the interpolation between 
different action units without tight control from the re- 
ceiver. 

[0020] The present invention includes action units to 
be animated and their intensity in the additional infor- 
mation of the bookmarks. The decoder is required to in- 
terpolate between the action units and their intensities 
between consecutive bookmarks. 
[0021] This provides the advantages of authoring an- 



c,5 



imations using simple tools, such as text editors, and 
significant savings in bandwidth. 
[0022] FIG 1 ^depicts the environment, in which the 
present invention is to be used. The"animation iscreated 
,and;Coded in the, enccxter, . section 1 : The encoded ani- 
mation is theprsent through a. communication channel 
(or stocage).t0 : a remote^estin^tton. At.the renjote.des- 
tination, the animation is recreated by . the decode r. 2. At 
th is- stage ( . the .decoder :2.rnust synchronize the-fa^ial 
animatbns-with jthe.spegch of the avatar .using :6qiy in- 
formation encoded withrthe original animation. . j; . 
.[0Q23] : , FIG 2 ; <depicts the* MPEG -4 architecture of the 
. decoder, which has been modified to operate according 
tolhe p res en t invent jon> The signal, from the encoder t 
is , : (n.ot shown) eotersthe pemultiplexer (DMUX) 3 via the 
transmission, channel {ox: storage, which can also be 
. ..modeled^s a channel). The DMUX 3 separates outs the 
., ; -text and-Jthe^vide&qJata, as well as the control and aux- 
iliary, inforrnatipn,. The FAP. stream, which includes the 
20 : : Encoder jTime Stamp (ET3), Is also output by the DMUX 
.3 directiy to .the. FA/AP/MP 4, which is coupled to the 
Text-to-Speech Converter (TTS) 5, a-' Phoneme FAP 
converter, 6, a,corr^psitpr,7:-and a visual decoder 8. A 
Lipjghape. Analyzer.SMs «oupJed : to Jhe-visual decoder 
8 and the TTS 5. User .input ..enters via the compositor 
7 ana* is output to the TTS 5 and the FA/AP/MP 4. These 
events include start^ stop, etc.,. . 

[0024] The TTS 4 reads the .bookmarks, and outputs 
the phonemes along with the ETS as well as with a Real- 
time Time Stamp (RTS) to the Phoneme FAP Converter 
6. The phonemes are. used to put the vertices of the wire- 
frame in the correct places. At this point the image is not 
rendered. 

[0025] This data is then output to the visual decoder 
8, which renders the image, and outputs the image in 
video-form to the compositor 7. It is in this stage that the 
FAPs are aligned with the phonemes by synchronizing 
the phonemes with the same .ETS/RTS combination 
yyjth the corresponding FAP. with the^matching ETS. 
[0026] The text ; input to the MPEG-4 hybrid text-to- 
speech (TTS) converter ^ is putput as coded speech to 
an audicdecpder-IO. Jn this system, the audio decoder 
,10 .outputs speech to the compositor 7, which acts as 
the. : interface, to, the. video display (not shown) and the 
speakers (noV.shpwn)s, ; as.well ( as to the user. 
[0027] . On the. video side, yideo- data output by the 
DMUX 3: is passed ; to the, visual decoder 8, which cre- 
ates the composite video signal based on the video data 
and the output from the FA/AP/MP 4. 
[0028] There are two different embodiments of the 
present invention. In a first embodiment, the ETS placed 
in the text stream includes the facial animation. That is, 
the bookmark (escape sequence) is followed by a 1 6 bit 
codeword that represents the appropriate facial anima- 
tion to be synchronized with the speech at this point in 
the animation. 

[0029] Alternatively, the ETS placed in the text stream 
can act as a pointer in time to a particular facial anima- 
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tton in the FAP stream. Specifically, the escape se- 
quence is followed by a 16 bit code that uniquely iden- 
tifies a particular place in the FAP stream. 
[0030] While the present invention has been de- 
scribed in terms of animation data, the animation data 
could be replaced with natural audio or video data. More 
specifically, the above description provides a method 
and system for aligning animation data with text-to- 
speech data. However, the same method and system 
applies if the text-to-speech data is replaced with audio 
or video. In fact, the alignment of the two data streams 
is independent of the underlying data, at least with re- 
gard to the TTS stream. 

Claims 

1 . A method for encoding a facial animation including 
at least one facial mimic and speech in the form of 
a text stream, comprising the steps of: 

a) assigning a predetermined code to the at 
least one facial mimic; and 

b) placing the predetermined code within the 
text stream, wherein said code indicates a pres- 
ence of a particular facial mimic. 

2. The method according to claim 1 , wherein the pre- 
determined code acts as a pointer to a stream of 
facial mimics thereby indicating a synchronization 
relationship between the text stream and the facial 
mimic stream. 



in said plurality of pointers establish a synchro- 
nization relationship with said data and said fa- 
cial mimics. 

s 8: The method according to. claim. 7, wherein each of 
the plurality of pointers comprises a=time stamp. 

9. The method according to claim-7, wherein the data 
stream comprises a text stream that is to be con- 

10 verted to speech in a decoding process. 

10. The method according to claim 9 : further compris- 
ing the step of placing at least one of the plurality of 
pointers in between words in the text stream. 

15 

11. The method according to claim 9 ; further compris- 
ing the step of placing at least one of the plurality of 
^pointers in between syllables in the text strt^an. 

20 12. The method according to claim 7. further compris- 
ing the step of placing at least one of the plurality of 
pointers inside words in the text stream. 

13. The method according to claim 7, wherein the data 
25 siream comprises a video stream. 

14. The method according to claim 7, wherein the data 
stream comprises an audio stream. 

30 15. A method for decoding a facial animation including 
speech and at least one facial mimic, comprising 
the steps of: 



3. The method according to claim 1 , wherein the pre- 
determined code comprises an escape sequence 
followed by a plurality of bits, which define one of a 
set of possible facial mimics.. 

4. The method according to claim 1 , further compris- 
ing the step of placing the predetermined code in 
between words in the text stream. 

5. The method according to claim 1 , further compris- 
ing the step of placing the predetermined code in 
between letters in the text stream. 

6. The method according to claim 1 , further compris- 
ing the step of placing the predetermined code in- 
side words in the text stream. 

7. A method for encoding a facial animation compris- 
ing the steps of: 

a) creating a data stream; 

b) creating a facial mimic stream; 

c) inserting a plurality of pointers in the data 
stream pointing to a corresponding plurality of 
facial mimics in the facial mimic stream, where- 



a) monitoring a text stream for a set of prede- 
35 termined codes corresponding to a set of facial 

mimics; and 

b) sending a signal to a visual decoder to start 
a particular facial mimic upon detecting the 
presence of one of the set of predetermined 

40 - codes. 

16. The method according to claim 1 5, wherein the pre- 
determined code acts as a pointer to a stream of 
facial mimics thereby indicating a synchronization 

45 relationship between the text stream and the facial 
mimic stream. 

17. The method according to claim 1 5, wherein the pre- 
determined code comprises an escape sequence. 

50 

18. The method according to claim 15, further compris- 
ing the step of placing the predetermined code in 
between words in the text stream. 

55 19. The method according to claim 1 5, further compris- 
ing the step of placing the predetermined code in 
between phonemes in the text stream. 



4 



BNSDOCID: <EP 0896322A2_L> 



7 F.P 0 896 322 A2 8 

20. The method according to claim '^further com pris- , 
ing the step of placing the- predetermined code in- 
side words in the text stream. 

21. An apparatus tor decoding an encoded animation 5 . . 
comprising: 1 ..... .. . 

a) a demultiplexer receiving the encoded ani- .■ , * * . , 
matiori; otrtputting a text stream and a facial an- ••;.:.< . ■ - -. . 
imatkjn parameter stream; wherein- said text , . , . 
stream includes a plurality of codes indicating v- - 
a synch rohizat ion relationship with a plurality of \> .... 
mimics : in the facial animation parameter 
stream arid the^ text in thetext stream;. • 

b) a text to speech converter coupled to the de- *5 
multiplexer, converting . the text stream to 
speech; outputting a plurality of phonemes, and 
a plurality of real-time time stamps and the plu- 
rality of codes in a one-to-one correspondence, 

whereby the plurality of real-time time stamps 20 . > 

and the plurality of codes indicate a synchroni- 
zation relationship between the plurality of 
mimics and the plurality of phonemes; and 

c) a phoneme to video converter being coupled 

to the text to speech converter,. synchronizing 25 , ■ r 

a plurality of facial mimics with the plurality of 
phonemes based on the. plurality of real-time 
time stamps and the plurality of codes. 

22. The apparatus according to claim 21, further com- 30 FT! 



prising a compositor converting the speech and vid- 
eo to a composite video signal. 

23. The apparatus according to claim 21, wherein the 
phoneme to video converter includes: 35 

a) a facial animator creating a wireframe image 
based on the synchronized plurality of pho- 
nemes and the plurality of facial mimics; and 

b) a visual decoder being coupled to the demul- 40 
tiplexer and the facial animator, and rendering 
the video image based on the wireframe image. 
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(54) Method and system for aligning natural and styn^ video to speech synthesis 



(57) According to MPEG-4's ITS architecture, facial 
animation can be driven by two streams simultaneously 
- text, and Facial Animation Parameters. In this archi- 
tecture, text input is sent to a Text-To-Speech converter 
at a decoder that drives the mouth shapes of the face. 
Facial Animation Parameters are sent from an encoder 
to the face over the communication channel." The 
present invention includes codes (known as> bobk : : ; 
marks) in the text string transmitted to the;;1ext-to^ * 
Speech converter, which bookmarks are placed be- 
tween words as well as inside them. According to the 
present invention, the bookmarks carry an encoder time 
stamp. Due to the nature of text-to-speech conversion, 
the encoder time stamp does not relate to real-World 
time, and should be interpreted as a counter.' In addition, 
the Facial Animation Parameter stream carries the 
same encoder time stamp found in the bookmark of the 
text. The system of the present invention reads the 
bookmark and provides the encoder time stamp as well 
as a real-time time stamp to the facial animation system. 
Finally, the facial animation system associates the cor- 
rect facial animation parameter with the real-time time 
stamp using the encoder time stamp of the bookmark 
as a reference. 
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